Use specialized dictionary kernels (#1178) #2808

tustvold · 2022-06-28T18:16:57Z

Which issue does this PR close?

Part of #1178

Rationale for this change

Arrow-rs supports native evaluation on dictionaries for comparison operations against other dictionaries and scalars. We should make use of this to avoid hydrating dictionaries unnecessarily

What changes are included in this PR?

Tweaks the coercion rules to coerce to the dictionary type if supported by the operator

Are there any user-facing changes?

No

tustvold · 2022-06-28T18:18:20Z

datafusion/physical-expr/src/expressions/binary.rs

+        match (&left_value, &left_data_type, &right_value, &right_data_type) {
+            // Types are equal => valid
+            (_, l, _, r) if l == r => {}
+            // Allow comparing a dictionary value with its corresponding scalar value


This is actually necessary for correctness in addition to being beneficial for performance, because ScalarValue does not have a way to encode a dictionary data type

tustvold · 2022-06-28T18:23:35Z

Unsurprisingly the performance benefits of this are quite pronounced

scheduled: select count(*) from t where dict_10_required = 'prefix#0'                                                                             
                        time:   [4.0683 ms 4.0732 ms 4.0783 ms]
                        change: [-40.646% -40.476% -40.299%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

tokio: select count(*) from t where dict_100_required = 'prefix#0'                                                                             
                        time:   [4.4917 ms 4.5479 ms 4.6022 ms]
                        change: [-39.027% -38.266% -37.502%] (p = 0.00 < 0.05)
                        Performance has improved.

scheduled: select count(*) from t where dict_100_required = 'prefix#0'                                                                             
                        time:   [3.8694 ms 3.8755 ms 3.8815 ms]
                        change: [-33.176% -32.985% -32.795%] (p = 0.00 < 0.05)
                        Performance has improved.

tokio: select count(*) from t where dict_1000_required = 'prefix#0'                                                                             
                        time:   [4.7944 ms 4.8326 ms 4.8687 ms]
                        change: [-31.344% -30.719% -30.083%] (p = 0.00 < 0.05)
                        Performance has improved.

codecov-commenter · 2022-06-28T19:13:37Z

Codecov Report

Merging #2808 (1513b88) into master (7617d78) will increase coverage by 0.11%.
The diff coverage is 96.15%.

@@            Coverage Diff             @@
##           master    #2808      +/-   ##
==========================================
+ Coverage   85.11%   85.22%   +0.11%     
==========================================
  Files         273      274       +1     
  Lines       48242    48634     +392     
==========================================
+ Hits        41059    41449     +390     
- Misses       7183     7185       +2

Impacted Files	Coverage Δ
datafusion/physical-expr/src/expressions/binary.rs	`95.12% <75.00%> (+<0.01%)`	⬆️
datafusion/expr/src/binary_rule.rs	`84.76% <100.00%> (+0.47%)`	⬆️
datafusion/core/src/physical_plan/join_utils.rs	`93.61% <0.00%> (-3.20%)`	⬇️
datafusion/sql/src/planner.rs	`81.36% <0.00%> (ø)`
...fusion/optimizer/src/single_distinct_to_groupby.rs	`98.80% <0.00%> (ø)`
...on/core/src/physical_optimizer/coalesce_batches.rs	`100.00% <0.00%> (ø)`
datafusion/optimizer/src/reduce_outer_join.rs	`99.39% <0.00%> (ø)`
datafusion/core/tests/sql/joins.rs	`99.31% <0.00%> (+0.20%)`	⬆️
datafusion/expr/src/logical_plan/plan.rs	`74.40% <0.00%> (+0.29%)`	⬆️
datafusion/core/src/config.rs	`90.76% <0.00%> (+0.44%)`	⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7617d78...1513b88. Read the comment docs.

alamb

Thank you @tustvold

alamb · 2022-06-30T00:26:25Z

datafusion/expr/src/binary_rule.rs

@@ -155,14 +155,12 @@ pub fn comparison_eq_coercion(
    lhs_type: &DataType,
    rhs_type: &DataType,
 ) -> Option<DataType> {
-    // can't compare dictionaries directly due to
-    // https://github.com/apache/arrow-rs/issues/1201


Use specialized dictionary kernels (apache#1178)

3858c95

tustvold commented Jun 28, 2022

View reviewed changes

github-actions bot added logical-expr Logical plan and expressions physical-expr Physical Expressions labels Jun 28, 2022

Fix tests

1513b88

tustvold marked this pull request as ready for review June 29, 2022 19:31

Dandandan approved these changes Jun 29, 2022

View reviewed changes

alamb approved these changes Jun 30, 2022

View reviewed changes

alamb merged commit 6e0bb84 into apache:master Jun 30, 2022

tustvold mentioned this pull request Jun 30, 2022

Support DictionaryArray in Like Kernels apache/arrow-rs#1975

Closed

tustvold mentioned this pull request Jul 12, 2022

Fix casts of ScalarValue::Utf8 to DictionaryArray #2875

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use specialized dictionary kernels (#1178) #2808

Use specialized dictionary kernels (#1178) #2808

tustvold commented Jun 28, 2022 •

edited

Loading

tustvold Jun 28, 2022

tustvold commented Jun 28, 2022

codecov-commenter commented Jun 28, 2022

alamb left a comment

alamb Jun 30, 2022

Use specialized dictionary kernels (#1178) #2808

Use specialized dictionary kernels (#1178) #2808

Conversation

tustvold commented Jun 28, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold Jun 28, 2022

Choose a reason for hiding this comment

tustvold commented Jun 28, 2022

codecov-commenter commented Jun 28, 2022

Codecov Report

alamb left a comment

Choose a reason for hiding this comment

alamb Jun 30, 2022

Choose a reason for hiding this comment

tustvold commented Jun 28, 2022 •

edited

Loading